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Simulating charged particle motion through the elements is necessary to understand modern particle accel- 
erators. The particle numbers and the circling turns in a synchrotron are huge, and a simulation can be time- 
consuming. Open multi-processing (OpenMP) is a convenient method to speed up the computing of multi-cores 
for computers based on share memory model. Using message passing interface (MPI) which is based on non- 
uniform memory access architecture, a coarse grain parallel algorithm is set up for the Accelerator Toolbox (AT) 
for dynamic tracking processes. The computing speedup of the tracking process is 3.77 times with a quad-core 


CPU computer and the speed almost grows linearly with the number of CPU. 
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I. INTRODUCTION 


The Accelerator Toolbox (AT) code was developed at 
the Stanford Synchrotron Radiation Lightsource (SSRL) to 
model particle accelerators and beam transport lines in the 
MATLAB environment [1]. Users of AT can develop their 
own functions and applications to meet their various require- 
ments by building on the AT source code. The AT results 
agree well with experimental measurements. At Shanghai 
Synchrotron Radiation Facility (SSRF), AT has been used 
for several years [2, 3]. In storage ring simulations, dynamic 
aperture tracking and lattice design optimization require high- 
ly computation-intensive algorithms. To finish the computa- 
tion in an acceptable time, one needs to make the code more 
efficient. The bottlenecks in AT performance come from mas- 
sive repeated calls of particle tracking functions which are 
independent of each other. So paralleling computing is a s- 
traightforward way to hasten the computing speed [4]. 

There are two common ways to parallel a program: us- 
ing graphics processing unit (GPU) to compute, and using 
multi-core CPU to start multi-thread computation. Since the 
computing processes of AT’s tracking program involve lots 
of cache operations, the frequent exchange data between the 
computer memory and GPU makes GPU computing less at- 
tractive when the number of particles is relatively small. So 
using multi-core CPU to compute is a better option. Open 
multi-processing (OpenMP) is a standardized model to share 
memory computing that supports C/C++ compiler. On a local 
qual-core computer, one is able to achieve a speed increase of 
3.7 times. Based on message passing model, message pass- 
ing interface (MPI) is used on a dual CPU server to further 
increase the speed. And the speed of computing grows lin- 
early with the number of computers. 
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I. EXPERIMENTAL 
A. Accelerator physics and optimization analysis 


The particle motion is modeled by using a point in 6- 
dimensional phase space coordinates to represent a particle 
in AT. 


X= (x, De, Y, Py, (P — po) /po;, cT)” , (1) 


where, x and y are the transverse coordinates, p, and py are 
the divergences of x and y, (p — po)/po is the momentum 
spread, and cz is the longitudinal position. Evolution of the 
phase space point through a magnetic accelerator-element can 
be modelled using the second order transport matrix [5] 


6 6 6 
XMI = AX, + YO Rye Xe+ > Tj XeXr. (2) 
k=1 k=1 l=1 


Therefore, many multiply-reduced operations can be re- 
quired to compute the entire process evolving a particle 
through a single magnetic element. The particles are assumed 
to be sufficiently relativistic that inter-particle interactions 
can be ignored, allowing the same operation to be applied 
to all particles in parallel [5]. Different accelerator element 
has different transport matrix. In AT code, they are calculat- 
ed in different ’passmethod’ functions, which are called mil- 
lion times in tracking process by AT’s ’ringpass’ function to 
calculate the particle trajectories. Calculating multi-particle 
through the accelerator element is a SIMD (single instruction 
multiple data) operation, and thus well suited to OpenMP and 
MPI processing [6]. 

Parallel computing uses multiple computing resources to 
work simultaneously on different parts of a problem [7]. It 
is an efficient way to speed up computing. We focus main- 
ly on making AT compatible with parallel processing with 
OpenMP and MPI. OpenMP is an application programming 
interface (API) for multi-thread programming in C/C++ and 
FORTRAN. It offers a highly abstract description of parallel 
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computing. It is composed of a set of compiler directives, li- 
brary routines and environment variables affecting run-time 
behavior [8]. By introducing OpenMP routines and directives 
to the existing AT source code, we make AT follow a unifor- 
m memory access (UMA) model, in which all the cores of 
processors share the same physical memory uniformly. It re- 
quires moderate changes to the ’passmethod’ functions writ- 
ten in C language, for which OpenMP can be implemented 
by Matlab’s MEX compiling function [9]. 


MPI is the standard of distribution model using explicit 
ways to control parallel computing. MatlabMPI is a Matlab 
implementation of the MPI standard and allows any Matlab 
program to exploit multiple processors [10]. A non-uniform 
memory access (NUMA) architecture is built with OpenM- 
P and MPI. In this model, multi machines run independently 
with its own local memory and communicate between each 
other with Bus Interconnect. Fig. 1 shows schematically the 
NUMA model. 
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Fig. 1. Schematics of non-uniform memory access (NUMA). 


In this case, we use a general distributed memory model. 
The upper computers use MatlabMPI to control the parallel 
computing and communication, and the lower computers use 
OpenMP to compute. With message-passing functions, the 
data are distributed to lower computers, and the computing 
results from the lower computers are then transferred to the 
upper computers and combined into the final result. This pat- 
tern allows us to use the shared memory computing interface 
to manage local task distribution for every CPU, and makes 
AT functioning in a global distributed memory model. In this 
way, we can avoid the data conflict caused by two processors 
trying to access to the same memory. If only UMA model is 
used, such a conflict will cause a longer CPU spin time. In 
some cases, the data-conflict delay can make working time of 
two processors longer than that of one processor. 
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B. Parallel AT with OpenMP and MPI 


As mentioned above, the speed increase of passmethod 
functions will increase speed of computing. OpenMP and 
MPI are used to parallelize the passmethod functions. The 
way to create an OpenMP and MPI program with existing 
code is to find sections of the codes that can be processed si- 
multaneously. The changes in the codes and the way to com- 
pile the codes are given in the appendix. The parallel part of 
the function use library functions omp_get_num_threads() and 
omp_get_thread_num() to get the number of threads in the par- 
allel zone and the id of the working thread (all the threads are 
numbered from zero). The start_index is the offset for each 
computing core. According to the thread id number, different 
thread works on different data. In this way, parallelizing the 
data and tasks are realized. 

Taking DriftPass.c [2], which is used to calculate the status 
after the particles passing through a drift element, as an ex- 
ample. Intel VTune Amplifier [11] is used to test efficiency 
of the parallel DriftPass.c. The number of particles is 3920, 
with 500000 loops on a qual-core computer. The results are 
given in Table 1, where the CPU time is the sum of CPU time 
of all threads, and the overhead & spin time is the time an 
active thread takes to get a synchronous construct. These t- 
wo make up most of the CPU time. The overhead & spin time 
takes about 9.3% of all CPU time. That is the main reason the 
speedup of a parallel program cannot reach the limit value of 
4. The speedup is 2.98. 


TABLE 1. Computation time for parallel and non-parallel DriftPass 


Type of Elapsed CPU time Overhead & 
computing time (s) (s) spin time (s) 
Parallel 10.28 77.83 7.22 
Non-parallel 30.09 30.05 0.00 


OpenMP is used to parallelize all the ’passmethod’ func- 
tions called in ’ringpass’, with all of the *passmethod’ codes 
being parallelizable. Therefore ringpass can be treated as a 
parallel program. 


Ill. RESULTS AND DISCUSSION 


Frequency map analysis (FMA) is an analysis method to 
find the amplitude of frequency shifts within a dynamic aper- 
ture. The program flow traces the particles, obtains output 
data of the particles through N turns, and uses a first order 
Hamming filter to filter the data [12]. FMA is applied as a 
frequency scanning tool to reveal information about nonlin- 
ear resonances and guide frequency optimization [13]. The 
particle tracking takes most of the computing time. OpenMP 
is used to reprogram AT to shorten the time, which may save 
days or weeks. Figure 2 shows the result of using parallel and 
non-parallel methods to compute FMA with different num- 
bers of particles using an Intel 17-3770 CPU with 4G RAM. 

The FMA execution time grows almost linearly with the 
number of particles. The non-parallel method is up to 3.16 
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Fig. 2. FMA execution time for non-parallel and parallel computing 
as function of number of particles. 


TABLE 2. Time profile (in second) using parallel and non-parallel 
computing 


Parallel computing Non-parallel computing 


a Te Tan T/T Te Tan Te /Tan 
512 1.19 5.87 0.202 1.25 17.52 0.071 
1128 2.66 12.58 0.207 2.82 39.11 0.072 
2450 5.75 28.04 0.205 6.16 85.92 0.071 
4418 10.18 51.28 0.206 11.01 152.52 0.072 
9800 22.47 109.74 0.207 24.40 339.43 0.072 
177578 50.57 191.24 0.212 43.50 604.51 0.072 


times slower than the parallel method. According to Am- 
dahl’s law [14] 


S= [fpar/.P + (1 >. iol (3) 


where fpar is the parallel fraction of the code, P is the speedup 
for the parallel part, and S is the speedup of the whole pro- 
gram. The profile command is used to obtain the time for 
parallel and non-parallel parts of the program flow. The par- 
allel part is the ’ringpass’ function and the main non-parallel 
part is the FMA function. The remaining parts take little of 
the total time. Table 2 shows the results for the parallel and 
non-parallel programs, with N being the number of particles, 
T; the time the FMA function takes and 7, the time for the 
whole program. 

From Table 2, the T;/T,n ratio is stable for both types of 
computing. According to Amdahl’s law, the speedup of the 
parallel part can be obtained by Eq. (4) 


[(1 — 0.07)/P +0.07]~* = 3.16, (4) 


[1] Terebilo A. Accelerator toolbox for MATLAB. SLAC-PUB- 
8732, 2001. 
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where 0.07 is the non-parallel part of the computing process. 
So, the speedup is P = 3.77, and it never exceeds 4, with 
even larger number of particles. The CPU is quad-core, so 
there will be a maximum of 4 computing threads executing at 
any one time, and synchronization between the threads also 
reduces the compute speed. 

To take advantage of the speedup factor increase by Open- 
MP and MPI, a Dell R720 server is utilized for particle track- 
ing in the slow extraction of Shanghai Proton Therapy Syn- 
chrotron. R720 has 2 processors which has 16 CPU cores 
each. It acts as two compute nodes in the computing pro- 
cess. The speedup of one node is 6.23. The speedup of 2 
nodes is 12.18 which is almost double of the speedup of one 
node. Since the computing process is independent of each 
other, and the communication takes about 2.3% of the total 
running time, the speedup of two nodes should almost double 
the speedup of one node. It can be estimated that the speedup 
of the computing process can grow linearly with the number 
of CPU using OpenMP and MPI. 


IV. CONCLUSION 


In this paper, we have introduced the way AT works, how 
OpenMP and MPI can be used in parallelizing programs. The 
parallelized AT can compute faster. If the code can be paral- 
lelized, OpenMP and MPI can be used in a similar way for 
other accelerator physics programs. This pattern is conve- 
nient to use and the speedup is close to the limit of what can 
be achieved by a single computer or a cluster. With more 
computer nodes, larger problems can be solved. 


APPENDIX 


#include<omp.h> 

ies some computation and 
Omp_set_num_threads(4) 
#pragma omp parallel private (i) share( start_index ,n) 


initialization 


thread_id =omp-_get_thread_num(); 
num_threads=omp_get_num_threads(); 

start = startindex + nxthread_id /num_ threads; 
if ( thread_id ==num_threads— 1) 
end=n—1; 

else 
end=nx(thread_num+1)/ num_threads—1; 
for (i= start ;i<=end;i++){ 

... computation 


I 
} 
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